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Abstract 

Models  based  on  deep  convolutional  networks  have  dom¬ 
inated  recent  image  interpretation  tasks;  we  investigate 
whether  models  which  are  also  recurrent ,  or  “temporally 
deep  ”,  are  effective  for  tasks  involving  sequences,  visual 
and  otherwise.  We  develop  a  novel  recurrent  convolutional 
architecture  suitable  for  large-scale  visual  learning  which 
is  end-to-end  trainable,  and  demonstrate  the  value  of  these 
models  on  benchmark  video  recognition  tasks,  image  de¬ 
scription  and  retrieval  problems,  and  video  narration  chal¬ 
lenges.  In  contrast  to  current  models  which  assume  a  fixed 
spatio-temporal  receptive  field  or  simple  temporal  averag¬ 
ing  for  sequential  processing,  recurrent  convolutional  mod¬ 
els  are  “doubly  deep ”  in  that  they  can  be  compositional 
in  spatial  and  temporal  “layers”.  Such  models  may  have 
advantages  when  target  concepts  are  complex  and/or  train¬ 
ing  data  are  limited.  Learning  long-term  dependencies  is 
possible  when  nonlinearities  are  incorporated  into  the  net¬ 
work  state  updates.  Long-term  RNN  models  are  appealing 
in  that  they  directly  can  map  variable -length  inputs  (e.g., 
video  frames)  to  variable  length  outputs  (e.g.,  natural  lan¬ 
guage  text)  and  can  model  complex  temporal  dynamics;  yet 
they  can  be  optimized  with  backpropagation.  Our  recurrent 
long-term  models  are  directly  connected  to  modern  visual 
convnet  models  and  can  be  jointly  trained  to  simultaneously 
learn  temporal  dynamics  and  convolutional  perceptual  rep¬ 
resentations.  Our  results  show  such  models  have  distinct 
advantages  over  state-of-the-art  models  for  recognition  or 
generation  which  are  separately  defined  and/or  optimized. 


1.  Introduction 

Recognition  and  description  of  images  and  videos  is 
a  fundamental  challenge  of  computer  vision.  Dramatic 


Figure  1:  We  propose  Long-term  Recurrent  Convolutional  Net¬ 
works  (LRCNs),  a  class  of  architectures  leveraging  the  strengths 
of  rapid  progress  in  CNNs  for  visual  recognition  problem,  and  the 
growing  desire  to  apply  such  models  to  time-varying  inputs  and 
outputs.  LRCN  processes  the  (possibly)  variable-length  visual  in¬ 
put  (left)  with  a  CNN  (middle-left),  whose  outputs  are  fed  into  a 
stack  of  recurrent  sequence  models  (. LSTMs ,  middle-right),  which 
finally  produce  a  variable-length  prediction  (right). 

progress  has  been  achieved  by  supervised  convolutional 
models  on  image  recognition  tasks,  and  a  number  of  exten¬ 
sions  to  process  video  have  been  recently  proposed.  Ideally, 
a  video  model  should  allow  processing  of  variable  length 
input  sequences,  and  also  provide  for  variable  length  out¬ 
puts,  including  generation  of  full-length  sentence  descrip¬ 
tions  that  go  beyond  conventional  one-versus-all  prediction 
tasks.  In  this  paper  we  propose  long-term  recurrent  convo¬ 
lutional  networks  (LRCNs),  a  novel  architecture  for  visual 
recognition  and  description  which  combines  convolutional 
layers  and  long-range  temporal  recursion  and  is  end-to-end 
trainable  (see  Figure  1).  We  instantiate  our  architecture  for 
specific  video  activity  recognition,  image  caption  genera- 
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tion,  and  video  description  tasks  as  described  below. 

To  date,  CNN  models  for  video  processing  have  success¬ 
fully  considered  learning  of  3-D  spatio-temporal  filters  over 
raw  sequence  data  [14,  2],  and  learning  of  frame-to-frame 
representations  which  incorporate  instantaneous  optic  flow 
or  trajectory-based  models  aggregated  over  fixed  windows 
or  video  shot  segments  [17,  32].  Such  models  explore  two 
extrema  of  perceptual  time-series  representation  learning: 
either  learn  a  fully  general  time- varying  weighting,  or  apply 
simple  temporal  pooling.  Following  the  same  inspiration 
that  motivates  current  deep  convolutional  models,  we  advo¬ 
cate  for  video  recognition  and  description  models  which  are 
also  deep  over  temporal  dimensions;  i.e.,  have  temporal  re¬ 
currence  of  latent  variables.  RNN  models  are  well  known 
to  be  “deep  in  time”;  e.g.,  explicitly  so  when  unrolled,  and 
form  implicit  compositional  representations  in  the  time  do¬ 
main.  Such  “deep”  models  predated  deep  spatial  convolu¬ 
tion  models  in  the  literature  [30,  4.  ]. 

Recurrent  Neural  Networks  have  long  been  explored  in 
perceptual  applications  for  many  decades,  with  varying  re¬ 
sults.  A  significant  limitation  of  simple  RNN  models  which 
strictly  integrate  state  information  over  time  is  known  as  the 
“vanishing  gradient”  effect:  the  ability  to  backpropogate  an 
error  signal  through  a  long-range  temporal  interval  becomes 
increasingly  impossible  in  practice.  A  class  of  models 
which  enable  long-range  learning  was  first  proposed  in  [12], 
and  augments  hidden  state  with  nonlinear  mechanisms  to 
cause  state  to  propagate  without  modification,  be  updated, 
or  be  reset,  using  simple  memory-cell  like  neural  gates. 
While  this  model  proved  useful  for  several  tasks,  its  util¬ 
ity  became  apparent  in  recent  results  reporting  large-scale 
learning  of  speech  recognition  [1  ]  and  language  transla¬ 
tion  models  [37,  5]. 

We  show  here  that  long-term  recurrent  convolutional 
models  are  generally  applicable  to  visual  time- series  mod¬ 
eling;  we  argue  that  in  visual  tasks  where  static  or  flat  tem¬ 
poral  models  have  previously  been  employed,  long-term 
RNNs  can  provide  significant  improvement  when  ample 
training  data  are  available  to  learn  or  refine  the  representa¬ 
tion.  Specifically,  we  show  LSTM-type  models  provide  for 
improved  recognition  on  conventional  video  activity  chal¬ 
lenges  and  enable  a  novel  end-to-end  optimizable  mapping 
from  image  pixels  to  sentence-level  natural  language  de¬ 
scriptions.  We  also  show  that  these  models  improve  gen¬ 
eration  of  descriptions  from  intermediate  visual  representa¬ 
tions  derived  from  conventional  visual  models. 

We  instantiate  our  proposed  architecture  in  three  experi¬ 
mental  settings  (see  Figure  3).  First,  we  show  that  directly 
connecting  a  visual  convolutional  model  to  deep  LSTM  net¬ 
works,  we  are  able  to  train  video  recognition  models  that 
capture  complex  temporal  state  dependencies  (Figure  3  left; 
Section  4).  While  existing  labeled  video  activity  datasets 
may  not  have  actions  or  activities  with  extremely  complex 


time  dynamics,  we  nonetheless  see  improvements  on  the  or¬ 
der  of  4%  on  conventional  benchmarks. 

Second,  we  explore  direct  end-to-end  trainable  image  to 
sentence  mappings.  Strong  results  for  machine  translation 
tasks  have  recently  been  reported  [37,  5];  such  models  are 
encoder/decoder  pairs  based  on  LSTM  networks.  We  pro¬ 
pose  a  multimodal  analog  of  this  model,  and  describe  an 
architecture  which  uses  a  visual  convnet  to  encode  a  deep 
state  vector,  and  an  LSTM  to  decode  the  vector  into  an  natu¬ 
ral  language  string  (Figure  3  middle;  Section  5).  The  result¬ 
ing  model  can  be  trained  end-to-end  on  large-scale  image 
and  text  datasets,  and  even  with  modest  training  provides 
competitive  generation  results  compared  to  existing  meth¬ 
ods. 

Finally,  we  show  that  LSTM  decoders  can  be  driven  di¬ 
rectly  from  conventional  computer  vision  methods  which 
predict  higher-level  discriminative  labels,  such  as  the  se¬ 
mantic  video  role  tuple  predictors  in  [29]  (Figure  3  right; 
Section  6).  While  not  end-to-end  trainable,  such  models 
offer  architectural  and  performance  advantages  over  previ¬ 
ous  statistical  machine  translation-based  approaches,  as  re¬ 
ported  below. 

We  have  realized  a  generalized  “LSTM”- style  RNN 
model  in  the  widely- adopted  open  source  deep  learning 
framework  Caffe  [15],  incorporating  the  specific  LSTM 
units  of  [45,  37,  5]. 

2.  Background:  Recurrent  Neural  Networks 
(RNNs) 

Traditional  RNNs  (Figure  2,  left)  can  learn  complex  tem¬ 
poral  dynamics  by  mapping  input  sequences  to  a  sequence 
of  hidden  states,  and  hidden  states  to  outputs  via  the  follow¬ 
ing  recurrence  equations  (Figure  2,  left): 

ht  =  g{Wxhxt  +  Whhht- 1  +  bh) 

Zt  =  g(Whzht  +  bz) 

where  g  is  an  element-wise  non-linearity,  such  as  a  sigmoid 
or  hyperbolic  tangent,  xt  is  the  input,  ht  G  RN  is  the  hidden 
state  with  N  hidden  units,  and  yt  is  the  output  at  time  t. 
For  a  length  T  input  sequence  (xi,X2 , ...,  £t),  the  updates 
above  are  computed  sequentially  as  hi  (letting  ho  =  0),  2/1, 
V 2,  •••,  hr,  Vt- 

Though  RNNs  have  proven  successful  on  tasks  such  as 
speech  recognition  [41]  and  text  generation  [36],  it  can  be 
difficult  to  train  them  to  learn  long-term  dynamics,  likely 
due  in  part  to  the  vanishing  and  exploding  gradients  prob¬ 
lem  [12]  that  can  result  from  propagating  the  gradients 
down  through  the  many  layers  of  the  recurrent  network, 
each  corresponding  to  a  particular  timestep.  LSTMs  pro¬ 
vide  a  solution  by  incorporating  memory  units  that  allow 
the  network  to  learn  when  to  forget  previous  hidden  states 
and  when  to  update  hidden  states  given  new  information. 


As  research  on  LSTMs  has  progressed,  hidden  units  with 
varying  connections  within  the  memory  unit  have  been  pro¬ 
posed.  We  use  the  LSTM  unit  as  described  in  [44]  (Figure  2, 
right),  which  is  a  slight  simplification  of  the  one  described 
in  [10].  Letting  cr(x)  =  (1  +  e~x)  1  be  the  sigmoid  non¬ 
linearity  which  squashes  real- valued  inputs  to  a  [0, 1]  range, 
and  letting  <j)(x)  =  eeXfee-x  =  2a  (2x)  —  1  be  the  hyper¬ 
bolic  tangent  nonlinearity,  similarly  squashing  its  inputs  to 
a  [—1,1]  range,  the  LSTM  updates  for  timestep  t  given  in¬ 
puts  xu  ht~i,  and  ct- 1  are: 

it  <T(Wxixt  +  H-  bf) 

ft  =  a(WxfXt  +  Whfht- i  +  bf ) 

Of  —  O'iyVxo'Kt  H-  Whoht-!  +  bo) 

9t  =  <KWxcXt  H-  Whcht-1  +  fee) 
ct  ft  O  Ct— i  T~  it  ©  9t 

ht  =  otQ  <f>(ct) 


Figure  2:  A  diagram  of  a  basic  RNN  cell  (left)  and  an  LSTM  mem¬ 
ory  cell  (right)  used  in  this  paper  (from  [44],  a  slight  simplification 
of  the  architecture  described  in  [9],  which  was  derived  from  the 
LSTM  initially  proposed  in  [  ]). 

In  addition  to  a  hidden  unit  ht  G  RN ,  the  LSTM  in¬ 
cludes  an  input  gate  it  G  RN ,  forget  gate  ft  G  ,  output 
gate  ot  G  RN ,  input  modulation  gate  gt  G  RN ,  and  mem¬ 
ory  cell  ct  G  RN .  The  memory  cell  unit  ct  is  a  summation 
of  two  things:  the  previous  memory  cell  unit  ct-i  which 
is  modulated  by  ft,  and  gt,  a  function  of  the  current  input 
and  previous  hidden  state,  modulated  by  the  input  gate  it. 
Because  it  and  ft  are  sigmoidal,  their  values  lie  within  the 
range  [0,1],  and  it  and  ft  can  be  thought  of  as  knobs  that 
the  LSTM  learns  to  selectively  forget  its  previous  memory 
or  consider  its  current  input.  Likewise,  the  output  gate  ot 
learns  how  much  of  the  memory  cell  to  transfer  to  the  hid¬ 
den  state.  These  additional  cells  enable  the  LSTM  to  learn 
extremely  complex  and  long-term  temporal  dynamics  the 
RNN  is  not  capable  of  learning.  Additional  depth  can  be 
added  to  LSTMs  by  stacking  them  on  top  of  each  other,  us¬ 
ing  the  hidden  state  of  the  LSTM  in  layer  l  —  1  as  the  input 
to  the  LSTM  in  layer  /. 

Recently,  LSTMs  have  achieved  impressive  results  on 
language  tasks  such  as  speech  recognition  [10]  and  ma¬ 
chine  translation  [37,  5].  Analogous  to  CNNs,  LSTMs  are 


attractive  because  they  allow  end-to-end  fine-tuning.  For 
example,  [10]  eliminates  the  need  for  complex  multi-step 
pipelines  in  speech  recognition  by  training  a  deep  bidirec¬ 
tional  LSTM  which  maps  spectrogram  inputs  to  text.  Even 
with  no  language  model  or  pronunciation  dictionary,  the 
model  produces  convincing  text  translations.  [37]  and  [:  ] 
translate  sentences  from  English  to  French  with  a  multi¬ 
layer  LSTM  encoder  and  decoder.  Sentences  in  the  source 
language  are  mapped  to  a  hidden  state  using  an  encoding 
LSTM,  and  then  a  decoding  LSTM  maps  the  hidden  state 
to  a  sequence  in  the  target  language.  Such  an  encoder  de¬ 
coder  scheme  allows  sequences  of  different  lengths  to  be 
mapped  to  each  other.  Like  [10]  the  sequence- to- sequence 
architecture  for  machine  translation  circumvents  the  need 
for  language  models. 

The  advantages  of  LSTMs  for  modeling  sequential  data 
in  vision  problems  are  twofold.  First,  when  integrated  with 
current  vision  systems,  LSTM  models  are  straightforward 
to  fine-tune  end-to-end.  Second,  LSTMs  are  not  confined 
to  fixed  length  inputs  or  outputs  allowing  simple  modeling 
for  sequential  data  of  varying  lengths,  such  as  text  or  video. 
We  next  describe  a  unified  framework  to  combine  LSTMs 
with  deep  convolutional  networks  to  create  a  model  which 
is  both  spatially  and  temporally  deep. 

3.  Long-term  Recurrent  Convolutional  Net¬ 
work  (LRCN)  model 

This  work  proposes  a  Long-term  Recurrent  Convolu¬ 
tional  Network  (LRCN)  model  combinining  a  deep  hier¬ 
archical  visual  feature  extractor  (such  as  a  CNN)  with  a 
model  that  can  learn  to  recognize  and  synthesize  temporal 
dynamics  for  tasks  involving  sequential  data  (inputs  or  out¬ 
puts),  visual,  linsguistical  or  otherwise.  Figure  1  depicts  the 
core  of  our  approach.  Our  LRCN  model  works  by  pass¬ 
ing  each  visual  input  vt  (an  image  in  isolation,  or  a  frame 
from  a  video)  through  a  feature  transformation  (/)v(vt) 
parametrized  by  V  to  produce  a  fixed-length  vector  rep¬ 
resentation  (j)t  G  Rd.  Having  computed  the  feature-space 
representation  of  the  visual  input  sequence  (0i,  </>2,  •••,  0t), 
the  sequence  model  then  takes  over. 

In  its  most  general  form,  a  sequence  model  parametrized 
by  W  maps  an  input  xt  and  a  previous  timestep  hidden  state 
ht~  i  to  an  output  zt  and  updated  hidden  state  ht.  There¬ 
fore,  inference  must  be  run  sequentially  (i.e.,  from  top  to 
bottom,  in  the  Sequence  Learning  box  of  Figure  1),  by 
computing  in  order:  hi  =  fw(%i,ho)  =  fw(x i,0),then 
ft-2  =  fw(x 2,  hi),  etc.,  up  to  hr.  Some  of  our  models  stack 
multiple  LSTMs  atop  one  another  as  described  in  Section  2. 

The  final  step  in  predicting  a  distribution  P(yt)  at 
timestep  t  is  to  take  a  softmax  over  the  outputs  zt  of  the 
sequential  model,  producing  a  distribution  over  the  (in  our 
case,  finite  and  discrete)  space  C  of  possible  per-timestep 


outputs: 

7 -,/  \  exp (yVZcZt,c  +  6C) 

P\Vt  —  C)  —  - 7777 - — rT 

E  exp (WzcZt,c'  +  be) 
dec 

The  success  of  recent  very  deep  models  for  object  recog¬ 
nition  [23,  33,  38]  suggests  that  strategically  composing 
many  “layers”  of  non-linear  functions  can  result  in  very 
powerful  models  for  perceptual  problems.  For  large  T, 
the  above  recurrence  indicates  that  the  last  few  predictions 
from  a  recurrent  network  with  T  timesteps  are  computed  by 
a  very  “deep”  (T-layered)  non-linear  function,  suggesting 
that  the  resulting  recurrent  model  may  have  similar  repre¬ 
sentational  power  to  a  T-layer  neural  network.  Critically, 
however,  the  sequential  model’s  weights  W  are  reused  at 
every  timestep,  forcing  the  model  to  learn  generic  timestep- 
to-timestep  dynamics  (as  opposed  to  dynamics  directly  con¬ 
ditioned  on  t ,  the  sequence  index)  and  preventing  the  pa¬ 
rameter  size  from  growing  in  proportion  to  the  maximum 
number  of  timesteps. 

In  most  of  our  experiments,  the  visual  feature  transfor¬ 
mation  <j>  corresponds  to  the  activations  in  some  layer  of 
a  large  CNN.  Using  a  visual  transformation  0y(.)  which 
is  time-invariant  and  independent  at  each  timestep  has  the 
important  advantage  of  making  the  expensive  convolutional 
inference  and  training  parallelizable  over  all  timesteps  of 
the  input,  facilitating  the  use  of  fast  contemporary  CNN  im¬ 
plementations  whose  efficiency  relies  on  independent  batch 
processing,  and  end-to-end  optimization  of  the  visual  and 
sequential  model  parameters  V  and  W. 

We  consider  three  vision  problems  (activity  recognition, 
image  description  and  video  description)  which  instantiate 
one  of  the  following  broad  classes  of  sequential  learning 
tasks: 

1.  Sequential  inputs,  fixed  outputs  (Figure  3,  left): 
(xi,  X2,  xt)  ^  y •  The  visual  activity  recognition 
problem  can  fall  under  this  umbrella,  with  videos  of 
arbitrary  length  T  as  input,  but  with  the  goal  of  pre¬ 
dicting  a  single  label  like  running  or  jumping  drawn 
from  a  fixed  vocabulary. 

2.  Fixed  inputs,  sequential  outputs  (Figure  3,  middle): 
x  (yi,  V2,  yr)-  The  image  description  problem 
fits  in  this  category,  with  a  non-time- varying  image  as 
input,  but  a  much  larger  and  richer  label  space  consist¬ 
ing  of  sentences  of  any  length. 

3.  Sequential  inputs  and  outputs  (Figure  3,  right): 
{x1,x2,...,xT)  !->•  (j/i,j/2,...,j/T'>-  Finally,  it’s  easy 
to  imagine  tasks  for  which  both  the  visual  input  and 
output  are  time- varying,  and  in  general  the  number  of 
input  and  output  timesteps  may  differ  (i.e.,  we  may 
have  T  ^  T').  In  the  video  description  task,  for  exam¬ 
ple,  the  input  and  output  are  both  sequential,  and  the 


number  of  frames  in  the  video  should  not  constrain  the 
length  of  (number  of  words  in)  the  natural-language 
description. 

In  the  previously  described  formulation,  each  instance 
has  T  inputs  (xi ,  X2 , . . . ,  xt)  and  T  outputs  (yi ,  y2 , . . . ,  yr) • 
We  describe  how  we  adapt  this  formulation  in  our  hybrid 
model  to  tackle  each  of  the  above  three  problem  settings. 
With  sequential  inputs  and  scalar  outputs,  we  take  a  late 
fusion  approach  to  merging  the  per-timestep  predictions 
(yi,  2/2 ? •••,  yr)  into  a  single  prediction  y  for  the  full  se¬ 
quence.  With  fixed-size  inputs  and  sequential  outputs,  we 
simply  duplicate  the  input  x  at  all  T  timesteps  xt  x  (not¬ 
ing  this  can  be  done  cheaply  due  to  the  time-invariant  vi¬ 
sual  feature  extractor).  Finally,  for  a  sequence- to- sequence 
problem  with  (in  general)  different  input  and  output  lengths, 
we  take  an  “encoder-decoder”  approach  inspired  by  [A  ].  In 
this  approach,  one  sequence  model,  the  encoder ,  is  used  to 
map  the  input  sequence  to  a  fixed-length  vector,  then  an¬ 
other  sequence  model,  the  decoder ,  is  used  to  unroll  this 
vector  to  sequential  outputs  of  arbitrary  length.  Under  this 
model,  the  system  as  a  whole  may  be  thought  of  as  having 
T  +  T'  timesteps  of  input  and  output,  wherein  the  input  is 
processed  and  the  decoder  outputs  are  ignored  for  the  first 
T  timesteps,  and  the  predictions  are  made  and  “dummy” 
inputs  are  ignored  for  the  latter  T'  timesteps. 

Under  the  proposed  system,  the  weights  (V,  W)  of  the 
model’s  visual  and  sequential  components  can  be  learned 
jointly  by  maximizing  the  likelihood  of  the  ground  truth 
outputs  yt  conditioned  on  the  input  data  and  labels  up  to  that 
point  (xi :t,  yv.t-i)  In  particular,  we  minimize  the  negative 
log  likelihood  C(V,  W)  =  —  log  2/i:t— 1)  of 

the  training  data  (x,y). 

One  of  the  most  appealing  aspects  of  the  described  sys¬ 
tem  is  the  ability  to  learn  the  parameters  “end-to-end,”  such 
that  the  parameters  V  of  the  visual  feature  extractor  learn 
to  pick  out  the  aspects  of  the  visual  input  that  are  rele¬ 
vant  to  the  sequential  classification  problem.  We  train  our 
LRCN  models  using  stochastic  gradient  descent  with  mo¬ 
mentum,  with  backpropagation  used  to  compute  the  gradi¬ 
ent  V C(V.  W)  of  the  objective  C  with  respect  to  all  param¬ 
eters  (V,W). 

We  next  demonstrate  the  power  of  models  which  are  both 
deep  in  space  and  deep  in  time  by  exploring  three  appli¬ 
cations:  activity  recognition,  image  description,  and  video 
description. 

4.  Activity  recognition 

Activity  recognition  is  an  example  of  the  first  sequen¬ 
tial  learning  task  described  above;  T  individual  frames  are 
inputs  into  T  convolutional  networks  which  are  then  con¬ 
nected  to  a  single-layer  LSTM  with  256  hidden  units.  A 
large  body  of  recent  work  has  proposed  deep  architectures 
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Figure  3:  Task- specific  instantiations  of  our  LRCN  model  for  activity  recognition,  image  description,  and  video  description. 


for  activity  recognition  ([17,  32,  14,  2,  ]).  [32,  1  ]  both 
propose  convolutional  networks  which  learn  filters  based  on 
a  stack  of  N  input  frames.  Though  we  analyze  clips  of  16 
frames  in  this  work,  we  note  that  the  LRCN  system  is  more 
flexible  than  [32,  17]  since  it  is  not  constrained  to  analyz¬ 
ing  fixed  length  inputs  and  could  potentially  learn  to  rec¬ 
ognize  complex  video  sequences  ( e.g .,  cooking  sequences 
as  presented  in  6).  [1,2]  use  recurrent  neural  networks  to 
learn  temporal  dynamics  of  either  traditional  vision  features 
([1])  or  deep  features  ([  ]),  but  do  not  train  their  models 
end-to-end  and  do  not  pre-train  on  larger  object  recognition 
databases  for  important  performance  gains. 

We  explore  two  variants  of  the  LRCN  architecture:  one 
in  which  the  LSTM  is  placed  after  the  first  fully  connected 
layer  of  the  CNN  (LRCN-fc6)  and  another  in  which  the 
LSTM  is  placed  after  the  second  fully  connected  layer  of 
the  CNN  (LRCN-fcy).  We  train  the  LRCN  networks  with 
video  clips  of  16  frames.  The  LRCN  predicts  the  video  class 
at  each  time  step  and  we  average  these  predictions  for  final 
classification.  At  test  time,  we  extract  16  frame  clips  with  a 
stride  of  8  frames  from  each  video  and  average  across  clips. 

We  also  consider  both  RGB  and  flow  inputs.  Flow  is 
computed  with  [4]  and  transformed  into  a  “flow  image” 
by  centering  x  and  y  flow  values  around  128  and  mul¬ 
tiplying  by  a  scalar  such  that  flow  values  fall  between  0 
and  255.  A  third  channel  for  the  flow  image  is  created 
by  calculating  the  flow  magnitude.  The  CNN  base  of  the 
LRCN  is  a  hybrid  of  the  Caffe  [15]  reference  model,  a  mi¬ 
nor  variant  of  AlexNet  [23],  and  the  network  used  by  Zeiler 
&  Fergus  [46].  The  net  is  pre-trained  on  the  1.2M  image 
ILSVRC-2012  [31]  classification  training  subset  of  the  Im- 
ageNet  [7]  dataset,  giving  the  network  a  strong  initialization 
to  facilitate  faster  training  and  prevent  over-fitting  to  the  rel¬ 
atively  small  video  datasets.  When  classifying  center  crops, 


the  top-1  classification  accuracy  is  60.2%  and  57.4%  for 
the  hybrid  and  Caffe  reference  models,  respectively.  In  our 
baseline  model,  T  video  frames  are  individually  classified 
by  a  CNN.  As  in  the  LSTM  model,  whole  video  classifica¬ 
tion  is  done  by  averaging  scores  across  all  video  frames. 

4.1.  Evaluation 

We  evaluate  our  architecture  on  the  UCF-101  dataset 
[35]  which  consists  of  over  12,000  videos  categorized  into 
101  human  action  classes.  The  dataset  is  split  into  three 
splits,  with  a  little  under  8,000  videos  in  the  training  set  for 
each  split.  We  report  accuracy  for  split- 1. 

Figure  1,  columns  2-3,  compare  video  classification  of 
our  proposed  models  (LRCN-fc6,  LRCN-fcy)  against  the 
baseline  architecture  for  both  RGB  and  flow  inputs.  Each 
LRCN  network  is  trained  end-to-end.  To  determine  if  end- 
to-end  training  is  necessary,  we  also  train  a  LRCN-fc6 
network  in  which  only  the  LSTM  parameters  are  learned. 
The  fully  fine-tuned  network  increases  performance  from 
70.47%  to  71.12%,  demonstrating  that  end-to-end  fine- 
tuning  is  indeed  beneficial.  The  LRCN-fc6  network  yields 
the  best  results  for  both  RGB  and  flow  and  improves  upon 
the  baseline  network  by  2.12  %  and  4.75%  respectively. 

RGB  and  flow  networks  can  be  combined  by  comput¬ 
ing  a  weighted  average  of  network  scores  as  proposed  in 
[32].  Like  [32],  we  report  two  weighted  averages  of  the 
predictions  from  the  RGB  and  flow  networks  in  Table  1 
(right).  Since  the  flow  network  outperforms  the  RGB  net¬ 
work,  weighting  the  flow  network  higher  unsurprisingly 
leads  to  better  accuracy.  In  this  case,  LRCN  outperforms 
the  baseline  single-frame  model  by  3.88%. 

The  LRCN  shows  clear  improvement  over  the  baseline 
single-frame  system  and  approaches  the  accuracy  achieved 
by  other  deep  models.  [32]  report  the  results  on  UCF-101 


Model 

Input  Type 

RGB  Flow 

Weighted  Average 

1/2,  1/2  1/3,  2/3 

Single  frame 

65.40 

53.20 

- 

- 

Single  frame  (ave.) 

69.00 

72.20 

75.71 

79.04 

LRCN-fc6 

71.12 

76.95 

81.97 

82.92 

LRCN-fcr 

70.68 

69.36 

- 

- 

Table  1:  Activity  recognition:  Comparing  single  frame  models 
to  LRCN  networks  for  activity  recognition  in  the  UCF-101  [3  ] 
dataset,  with  both  RGB  and  flow  inputs.  Our  LRCN  model  con¬ 
sistently  and  strongly  outperforms  a  model  based  on  predictions 
from  the  underlying  convolutional  network  architecture  alone. 

by  computing  a  weighted  average  between  flow  and  RGB 
networks  (86.4%  for  split  1  and  87.6%  averaging  over  all 
splits).  Though  [17]  does  not  report  numbers  on  the  sepa¬ 
rate  splits  of  UCF-101,  the  average  split  accuracy  is  65.4% 
which  is  substantially  lower  than  our  LRCN  model. 

5.  Image  description 

In  contrast  to  activity  recognition,  the  static  image  de¬ 
scription  task  only  requires  a  single  convolutional  network 
since  the  input  consists  of  a  single  image.  A  variety  of  deep 
and  multi-modal  models  [8,  34,  20,  21,  16,  26,  21,  1  ]  have 
been  proposed  for  image  description;  in  particular,  [21,  19] 
combine  deep  temporal  models  with  convolutional  repre¬ 
sentations.  [21],  utilizes  a  “vanilla”  RNN  as  described 
in  Section  2,  potentially  making  learning  long-term  tempo¬ 
ral  dependencies  difficult.  Contemporaneous  with  and  most 
similar  to  our  work  is  [19],  which  proposes  a  different  ar¬ 
chitecture  that  uses  the  hidden  state  of  an  LSTM  encoder 
at  time  T  as  the  encoded  representation  of  the  length  T  in¬ 
put  sequence.  It  then  maps  this  sequence  representation, 
combined  with  the  visual  representation  from  a  convnet, 
into  a  joint  space  from  which  a  separate  decoder  predicts 
words.  This  is  distinct  from  our  arguably  simpler  architec¬ 
ture,  which  takes  as  per-timestep  input  a  copy  of  the  static 
input  image,  along  with  the  previous  word.  We  present 
empirical  results  showing  that  our  integrated  LRCN  archi¬ 
tecture  outperforms  these  prior  approaches,  none  of  which 
comprise  an  end-to-end  optimizable  system  over  a  hierar¬ 
chy  of  visual  and  temporal  parameters. 

We  now  describe  our  instantiation  of  the  LRCN  archi¬ 
tecture  for  the  image  description  task.  At  each  timestep, 
both  the  image  features  and  the  previous  word  are  provided 
as  inputs  to  the  sequential  model,  in  this  case  a  stack  of 
four  LSTMs  (each  with  1000  hidden  states),  which  is  used 
to  learn  the  dynamics  of  the  time- varying  output  sequence, 
natural  language.  At  timestep  t ,  the  input  to  the  bottom¬ 
most  LSTM  is  the  embedded  ground  truth  word  from  the 
previous  timestep  wt~\.  For  sentence  generation,  the  in¬ 
put  becomes  a  sample  wt- i  from  the  model’s  predicted  dis¬ 
tribution  at  the  previous  timestep.  The  second  LSTM  in 


the  stack  fuses  the  outputs  of  the  bottom-most  LSTM  with 
the  image  representation  0y(x)  to  produce  a  joint  repre¬ 
sentation  of  the  visual  and  language  inputs  up  to  time  t. 
(The  visual  model  </v(x)  used  in  this  experiment  is  the 
base  Caffe  [15]  reference  model,  very  similar  to  the  well- 
known  AlexNet  [23],  pre-trained  on  ILSVRC-2012  [31]  as 
in  Section  4.)  The  third  and  fourth  LSTMs  transform  the 
outputs  of  the  LSTM  below,  and  the  fourth  LSTM’s  outputs 
are  inputs  to  the  softmax  which  produces  a  distribution  over 
words  p(wt\wi:t-i). 

Without  any  explicit  language  modeling  or  defined  syn¬ 
tax  structure,  the  described  LRCN  system  learns  mappings 
from  pixel  intensity  values  to  natural  language  descriptions 
that  are  often  semantically  descriptive  and  grammatically 
correct. 

5.1.  Evaluation 

We  evaluate  our  image  description  model  on  both  im¬ 
age  retrieval  and  image  annotation  generation.  We  first 
show  the  effectiveness  of  our  model  by  quantitatively  eval¬ 
uating  it  on  the  image  retrieval  task  as  seen  in  [26,  16, 
34,  8,  19].  Our  model  is  trained  on  the  combined  train¬ 
ing  sets  of  the  Flickr30k  [13]  (28,000  training  images)  and 
COCO2014  [25]  dataset  (80,000  training  images).  We  re¬ 
port  results  on  Flickr30k  [13],  with  30,000  images  and  five 
sentence  annotations  per  image.  We  use  1000  images  each 
for  test  and  validation  and  the  remaining  28,000  for  train¬ 
ing. 

Image  retrieval  results  are  recorded  in  Table  2  and  re¬ 
port  median  rank,  Medr,  of  the  first  retrieved  ground  truth 
image  and  Recall  @K,  the  number  of  sentences  for  which 
the  correct  image  is  retrieved  in  the  top-K.  Our  model 
consistently  outperforms  the  strong  baselines  from  recent 
work  [19,  26,  16,  34,  ]  as  can  be  seen  in  Table  2.  Here, 
we  make  a  note  that  the  new  OxfordNet  model  in  [19]  out¬ 
performs  our  model  on  the  retrieval  task.  However,  Ox¬ 
fordNet  [19]  utilizes  a  better-performing  convolutional  net¬ 
work  to  get  the  additional  edge  over  the  base  ConvNet  [19]. 
The  strength  of  our  temporal  model  (and  integration  of  the 
temporal  and  visual  models)  can  be  more  directly  measured 
against  the  ConvNet  [19]  result,  which  uses  the  same  base 
CNN  architecture  [2  ]  pretrained  on  the  same  data. 

To  evaluate  sentence  generation,  we  use  the  BLEU  [27] 
metric  which  was  designed  for  automated  evaluation  of  sta¬ 
tistical  machine  translation.  BLEU  is  a  modified  form  of 
precision  that  compares  N-gram  fragments  of  the  hypothe¬ 
sis  translation  with  multiple  reference  translations.  We  use 
BLEU  as  a  measure  of  similarity  of  the  descriptions.  The 
unigram  scores  (B-l)  account  for  the  adequacy  of  (or  the 
information  retained)  by  the  translation,  while  longer  N- 
gram  scores  (B-2,  B-3)  account  for  the  fluency.  We  com¬ 
pare  our  results  with  [26]  (on  Flickr30k),  and  two  strong 
nearest  neighbor  baselines  computed  using  AlexNet  fc7  and 


R@1 

R@5 

R@10 

Medr 

DeViSE  [8] 

6.7 

21.9 

32.7 

25 

SDT-RNN  [34] 

8.9 

29.8 

41.1 

16 

DeFrag  [1  ] 

10.3 

31.4 

44.5 

13 

m-RNN  [26] 

12.6 

31.2 

41.5 

16 

ConvNet  [19] 

10.4 

31.0 

43.7 

14 

LRCN  (ours) 

14.0 

34.9 

47.0 

11 

Table  2:  Image  description:  Caption-to-image  retrieval  results  for 
the  Flickr30k  [13]  dataset.  R@K  is  the  average  recall  at  rank  K 
(high  is  good).  Medr  is  the  median  rank  (low  is  good).  Note  that 
[19]  achieves  better  retrieval  performance  using  a  stronger  CNN 
architecture  see  text. 


Flickr30k  [  3] 

B-l 

B-2 

B-3 

B-4 

m-RNN  [26] 

54.79 

23.92 

19.52 

- 

INN  fcg  base  (ours) 

37.34 

18.66 

9.39 

4.88 

INN  R7  base  (ours) 

38.81 

20.16 

10.37 

5.54 

LRCN  (ours) 

58.72 

39.06 

25.12 

16.46 

COCO  2014  [25] 

B-l 

B-2 

B-3 

B-4 

INN  fcg  base  (ours) 

46.04 

26.20 

14.95 

8.70 

INN  R7  base  (ours) 

47.47 

27.55 

15.96 

9.36 

LRCN  (ours) 

62.79 

44.19 

30.41 

21.00 

Table  3:  Image  description:  Sentence  generation  results  (BLEU 
scores  (%)  -  ours  are  adjusted  with  the  brevity  penalty)  for  the 
Flickr30k  [13]  and  COCO  2014  [25]  test  sets. 

fcg  layer  outputs.  We  used  1 -nearest  neighbor  to  retrieve  the 
most  similar  image  in  the  training  database  and  average  the 
BLEU  score  over  the  captions.  The  results  on  Flickr30k  are 
reported  in  Table  3.  Additionally,  we  report  results  on  the 
new  COCO2014  [25]  dataset  which  has  80,000  training  im¬ 
ages,  and  40,000  validation  images.  Similar  to  Flickr30k, 
each  image  is  annotated  with  5  or  more  image  annotations. 
We  isolate  5,000  images  from  the  validation  set  for  testing 
purposes  and  the  results  are  reported  in  Table  3. 

Based  on  the  B-l  scores  in  Table  3,  generation  using 
LRCN  performs  comparably  with  m-RNN  [2<  ]  in  terms  of 
the  information  conveyed  in  the  description.  Furthermore, 
LRCN  significantly  outperforms  the  baselines  and  the  m- 
RNN  with  regard  to  the  fluency  (B-2,  B-3)  of  the  genera¬ 
tion,  indicating  the  LRCN  retains  more  of  the  bigrams  and 
trigrams  from  the  human- annotated  descriptions. 

In  addition  to  standard  quantitative  evaluations,  we  also 
employ  Amazon  Mechnical  Turkers  (AMT)  to  evaluate  the 
generated  sentences.  Given  an  image  and  a  set  of  descrip¬ 
tions  from  different  models,  we  ask  Turkers  to  rank  the 
sentences  based  on  correctness,  grammar  and  relevance. 
We  compared  sentences  from  our  model  to  the  ones  made 
publicly  available  by  [19].  As  seen  in  Table  4,  our  fine- 
tuned  (ft)  LRCN  model  performs  on  par  with  the  Nearest 


Correctness 

Grammar 

Relevance 

TreeTalk  [24] 

4.08 

4.35 

3.98 

OxfordNet  [19] 

3.71 

3.46 

3.70 

NN  [19] 

3.44 

3.20 

3.49 

LRCN  fc8  (ours) 

3.74 

3.19 

3.72 

LRCN  ft  (ours) 

3.47 

3.01 

3.50 

Captions 

2.55 

3.72 

2.59 

Table  4:  Image  description:  Human  evaluator  rankings  from  1-6 
(low  is  good)  averaged  for  each  method  and  criterion.  We  eval¬ 
uated  on  785  Flickr  images  selected  by  the  authors  of  [19]  for 
the  purposes  of  comparison  against  this  similar  contemporary  ap¬ 
proach. 

Neighbour  (NN)  on  correctness  and  relevance,  and  better 
on  grammar.  We  show  example  sentence  generations  in  Fig¬ 
ure  5. 

6.  Video  description 

In  video  description  we  must  generate  a  variable  length 
stream  of  words,  similar  to  Section  5.  [11,  29,  18,  3,  6,  18, 
39,  40]  propose  methods  for  generating  sentence  descrip¬ 
tions  for  video,  but  to  our  knowledge  we  present  the  first 
application  of  deep  models  to  the  vision  description  task. 

The  LSTM  framework  allows  us  to  model  the  video  as 
a  variable  length  input  stream  as  discussed  in  Section  3. 
However,  due  to  limitations  of  available  video  description 
datasets  we  take  a  different  path.  We  rely  on  more  “tra¬ 
ditional”  activity  and  video  recognition  processing  for  the 
input  and  use  LSTMs  for  generating  a  sentence. 

We  first  distinguish  the  following  architectures  for  video 
description  (see  Figure  4).  For  each  architecture,  we  assume 
we  have  predictions  of  objects,  subjects,  and  verbs  present 
in  the  video  from  a  CRF  based  on  the  full  video  input.  In 
this  way,  we  observe  the  video  as  whole  at  each  time  step, 
not  incrementally  frame  by  frame. 

(a)  LSTM  encoder  &  decoder  with  CRF  max.  (Fig¬ 
ure  4(a))  The  first  architecture  is  motivated  by  the  video  de¬ 
scription  approach  presented  in  [29].  They  first  recognize  a 
semantic  representation  of  the  video  using  the  maximum  a 
posterior  estimate  (MAP)  of  a  CRF  taking  in  video  features 
as  unaries.  This  representation,  e.g.  (person, cut, cutting 
board) ,  is  then  concatenated  to  a  input  sentence  {person  cut 
cutting  board)  which  is  translated  to  a  natural  sentence  {a 
person  cuts  on  the  board)  using  phrase-based  statistical  ma¬ 
chine  translation  (SMT)  [22].  We  replace  the  SMT  with  an 
LSTM,  which  has  shown  state-of-the-art  performance  for 
machine  translation  between  languages  [37,  f  ].  The  archi¬ 
tecture  (shown  in  Figure  4(a))  has  an  encoder  LSTM  (or¬ 
ange)  which  encodes  the  one-hot  vector  (binary  index  vec¬ 
tor  in  a  vocabulary)  of  the  input  sentence  as  done  in  [37]. 
This  allows  for  variable-length  inputs.  (Note  that  the  input 


Figure  4:  Our  approaches  to  video  description,  (a)  LSTM  encoder  &  decoder  with  CRF  max  (b)  LSTM  decoder  with  CRF  max  (c)  LSTM 
decoder  with  CRF  probabilities.  (For  larger  figure  zoom  or  see  supplemental). 


Architecture 

Input  BLEU 

SMT  [29] 

CRF  max 

24.9 

SMT  [28] 

CRF  prob 

26.9 

(a)  LSTM  Encoder-Decoder  (ours) 

CRF  max 

25.3 

(b)  LSTM  Decoder  (ours) 

CRF  max 

27.4 

(c)  LSTM  Decoder  (ours) 

CRF  prob 

28.8 

Table  5:  Video  description:  Results  on  detailed  description  of 
TACoS  multilevel[28],  in  %,  see  Section  6  for  details. 


sentence  might  have  a  different  number  of  words  than  el¬ 
ements  of  the  semantic  representation.)  At  the  end  of  the 
encoder  stage,  the  final  hidden  unit  must  remember  all  nec¬ 
essary  information  before  being  input  into  the  decoder  stage 
(pink)  in  which  the  hidden  representation  is  decoded  into  a 
sentence,  one  word  at  each  time  step.  We  use  the  same  two- 
layer  LSTM  for  encoding  and  decoding. 

(b)  LSTM  decoder  with  CRF  max.  (Figure  4(b))  In 
this  variant  we  exploit  that  the  semantic  representation  can 
be  encoded  as  a  single  fixed  length  vector.  We  provide  the 
entire  visual  input  representation  at  each  time  step  to  the 
LSTM,  analogous  to  how  an  entire  image  is  provided  as  an 
input  to  the  LSTM  in  image  description. 

(c)  LSTM  decoder  with  CRF  prob.  (Figure  4(c))  A 
benefit  of  using  LSTMs  for  machine  translation  compared 
to  phrase-based  SMT  [22]  is  that  it  can  naturally  incorpo¬ 
rate  probability  vectors  during  training  and  test  time  which 
allows  the  LSTM  to  learn  uncertainties  in  visual  generation 
rather  than  relying  on  MAP  estimates.  The  architecture  is 
the  the  same  as  in  (b),  but  we  replace  max  predictions  with 
probability  distributions. 

6.1.  Evaluation 

We  evaluate  our  approach  on  the  TACoS  multilevel 
[28]  dataset,  which  has  44,762  video/sentence  pairs  (about 
40,000  for  training/validation).  We  compare  to  [29]  who 
use  max  prediction  as  well  as  a  variant  presented  in  [2!  ] 
which  takes  CRF  probabilities  at  test  time  and  uses  a  word 


lattice  to  find  an  optimal  sentence  prediction.  Since  we  use 
the  max  prediction  as  well  as  the  probability  scores  pro¬ 
vided  by  [28],  we  have  an  identical  visual  representation. 
[28]  uses  dense  trajectories  [42]  and  SIFT  features  as  well 
as  temporal  context  reasoning  modeled  in  a  CRF. 

Table  5  shows  the  B LEU-4  score.  The  results  show  that 
(1)  the  LSTM  outperforms  an  SMT-based  approach  to  video 
description;  (2)  the  simpler  decoder  architecture  (b)  and  (c) 
achieve  better  performance  than  (a),  likely  because  the  in¬ 
put  does  not  need  to  be  memorized;  and  (3)  our  approach 
achieves  28.8%,  clearly  outperforming  the  best  reported 
number  of  26.9%  on  TACoS  multilevel  by  [28]. 

More  broadly,  these  results  show  that  our  architecture 
is  not  restricted  to  deep  neural  networks  inputs  but  can  be 
cleanly  integrated  with  other  fixed  or  variable  length  inputs 
from  other  vision  systems. 

7.  Conclusions 

We’ve  presented  LRCN,  a  class  of  models  that  is  both 
spatially  and  temporally  deep,  and  has  the  flexibility  to  be 
applied  to  a  variety  of  vision  tasks  involving  sequential 
inputs  and  outputs.  Our  results  consistently  demonstrate 
that  by  learning  sequential  dynamics  with  a  deep  sequence 
model,  we  can  improve  on  previous  methods  which  learn  a 
deep  hierarchy  of  parameters  only  in  the  visual  domain,  and 
on  methods  which  take  a  fixed  visual  representation  of  the 
input  and  only  learn  the  dynamics  of  the  output  sequence. 

As  the  field  of  computer  vision  matures  beyond  tasks 
with  static  input  and  predictions,  we  envision  that  “dou¬ 
bly  deep”  sequence  modeling  tools  like  LRCN  will  soon 
become  central  pieces  of  most  vision  systems,  as  convo¬ 
lutional  architectures  recently  have.  The  ease  with  which 
these  tools  can  be  incorporated  into  existing  visual  recog¬ 
nition  pipelines  makes  them  a  natural  choice  for  perceptual 
problems  with  time- varying  visual  input  or  sequential  out¬ 
puts,  which  these  methods  are  able  to  produce  with  little 
input  preprocessing  and  no  hand-designed  features. 


A  female  tennis  player  in  action  on  the  A  group  of  young  men  playing  a  game  of  A  man  riding  a  wave  on  top  of  a  surf- 
court.  soccer.  board. 


A  baseball  game  in  progress  with  the  bat-  A  brown  bear  standing  on  top  of  a  lush  A  person  holding  a  cell  phone  in  their 
ter  up  to  plate.  green  field.  hand. 


A  large  clock  mounted  to  the  side  of  a 
A  black  and  white  cat  is  sitting  on  a  chair,  building. 


A  close  up  of  a  hot  dog  on  a  bun.  A  bath  room  with  a  toilet  and  a  bath  tub.  A  vase  filled  with  flower  sitting  on  a  table. 


Figure  5:  Sentences  generated  by  our  finetuned  LRCN  model.  Images  were  chosen  from  the  COCO  [25]  validation  set.  We  used  beam 
search  with  a  beam  size  of  5  to  generate  the  sentences,  and  display  the  top  (highest  likelihood)  result  above. 


A  bunch  of  fruit  that  are  sitting  on  a  table. 
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